Clare Heinbaugh

October 25, 2020

The Data

Here is the project statement I submitted to DHS to access the individual and household data.

I am in an agent-based modeling class researching development factors in a LMIC, in this case Eswatini. Our current project is to generate a synthetic population using household and individual data. In our first project we dispersed population based on raster data from WorldPop, and now we want to group our “agents” into households. The purpose of this research is to accurately describe the agents at work regarding human development.

I then downloaded the stata datasets for Individual recode and Household recode, and read them into R using Haven. Included in the DHS sample are 4843 household observations and 4987 individuals. The household data has 4293 variables and the individual 4217 variables. Here are some notable variables from the household data: - hv005: weights - hv009: number of household members
- hv104_01 to hv104_34: sex of household members
- hv105_01 to hv105_34: age of household members
- hv106_01 to hv106_34: education of household members
- hv024: adm1 labels and data

With these variables in mind, I created a new data frame called dhs_vars that is a subset of the columns of dhs_household.

The goal of this project is to generate shell data of the entire population that can be used as input for models to predict for additional variables.

Pivoting Data

The first step is to pivot the data. Before we pivot, each individual is in his/her own column for variables like level of education, age, and gender.

We want to pivot the data so that each individual will be a row with corresponding descriptors like age, person number, household id, gender, etc. We use the following code snippet.

# Use select to easily subset columns and put in a new, revised data frame 
dhs_vars <- dhs_household %>% dplyr::select(hhid, hv005, hv009, hv024, hv104_01:hv104_34, hv105_01:hv105_34, hv106_01:hv106_34)

gender_pivot_sample <- dhs_vars %>%
  gather(key = "person_number", value = "gender", colnames(dhs_vars)[5:38]) %>%
  dplyr::select(hhid:hv024, person_number, gender)

age_pivot_sample <- dhs_vars %>%
  gather(key = "person_number", value = "age", colnames(dhs_vars)[39:72]) %>%
  dplyr::select(hhid:hv024, person_number, age)

edu_pivot_sample <- dhs_vars %>%
  gather(key = "person_number", value = "education", colnames(dhs_vars)[73:106]) %>%
  dplyr::select(hhid:hv024, person_number, education)

swz_pns_sample <- cbind.data.frame(gender_pivot_sample, age=age_pivot_sample$age, education=edu_pivot_sample$education) %>% filter(!(is.na(gender) & is.na(age) & is.na(education)))
Here is our resulting data frame swz_pns_sample.

Locating the Households

Using the raster data from Eswatini in 2006, I distributed households using rpoint across the entire country and produced the following plot.

*Figure 1:* `Rpoint` distributed according to raster

Figure 1: Rpoint distributed according to raster

Now instead of distributing the entire population, we only want to distribute as many household in our dataset by each adm1.

To do this with only our four adm1’s, we can subset all observations by hv024 and then distribute the rpoint’s randomly according the the underlying raster distribution. Then we can match up these random points to actual households.

Key:
- 1 hhohho
- 2 manzini
- 3 shiselweni
- 4 lubombo

To do this we start by getting the number of households in each region according to the DHS household data. Then, we crop and mask each region to the correct raster and write the sf object back to shape files. After reading in each new shape file as a map tools object, we create rpoint distributions for each adm1.

*Figure 2:* `Rpoint`'s distributed according to the raster of each `adm1`*Figure 2:* `Rpoint`'s distributed according to the raster of each `adm1`*Figure 2:* `Rpoint`'s distributed according to the raster of each `adm1`*Figure 2:* `Rpoint`'s distributed according to the raster of each `adm1`

Figure 2: Rpoint’s distributed according to the raster of each adm1

Now that we have distributed the correct number of households in each adm1 according to the underlying raster distribution, we can stack the latitude and longitude values as rows until we have two complete longitude and latitude columns for the adm0 object. This is accomplished with the following command and the first 10 rows are outputted:

all_houses <- rbind.data.frame(as.data.frame(hhohho_houses), as.data.frame(manzini_houses), as.data.frame(shiselweni_houses), as.data.frame(lubombo_houses))
output_all_houses <- all_houses  %>% head(10) 
output_all_houses

After converting to geometries, here is the improved rpoint distribution according to adm1s, recolored to ensure the points are properly located in their corresponding adm1’s as indicated by the DHS household variable hv024.

*Figure 3:* `Rpoint` distributed according to `adm1` rasters for entire country

Figure 3: Rpoint distributed according to adm1 rasters for entire country

Since each the recolored rpoint’s remain in the confines of their adm1’s, it appears the method to redistribute rpoint’s at each of the adm1 levels and then stack the resulting locations for the entire country worked. While this approach of counting the households in each adm1, cropping, masking, and randomly distributing points worked fine with just four adm1’s, a different method likely using a for-loop would be necessary at the adm2 level for Eswatini or with another country devised of many more adm1’s.

Ignoring the recoloring in Figure 3, Figure 1 and Figure 3 look exactly the same to the naked eye. This is likely because we have so few adm1’s, that there is not much of a difference distributing households according to their designated adm1 and at the national level while abiding by the underlying raster distribution in both cases.

The natural question that arises is, does this mimic reality? Is it really better to distribute the houses this way instead of randomly? To compare, here is a random rpoint distribution at the national level.

*Figure 4:* `Rpoint` distributed equal probability of being anywhere

Figure 4: Rpoint distributed equal probability of being anywhere

The first thing to notice is the the separation between urban and rural is absent. In the last project, we wanted to look at de facto settlements based on the population density in different regions and our de facto settlements are obliterated. Even producing a plot using the DHS labels for each adm1 is not much better as seen here.

*Figure 5:* `Rpoint` distributed equal probability of being anywhere while ensuring each `adm1` has the correct number of houses

Figure 5: Rpoint distributed equal probability of being anywhere while ensuring each adm1 has the correct number of houses

As seen in Figure 5, while the number of households in each adm1 now corresponds to the survey data, the de facto settlements are still missing and any information about how people settle near each other is gone. Thus, distributing houses using both the survey data about the number of houses in each adm1 and the raster distribution originating from WorldPop provides a better than random approximation of where the individuals from the sample live. Thus, the outcome in Figure 3 is somewhere in accuracy between randomly distributing people with a distibution function of equal probability and knowing the exact longitude and latitude location of each sample household.

Imputing Missing Data

One of the most difficult challenges to overcome when using survey data is the zero cell problem. One solution is to simply delete the observations that contain NA, but this means we throw out valuable observations in the cases where only a couple of observations are missing. Another solution is to impute the missing data as shown here for the DHS sample data. In this case we use missForest a random forest imputation method that works for both qualitative and quantitative data.

# We want to do this using parallel processing to speed it up; we can use a max of 3 cores for our three variables
registerDoParallel(cores=3)

# Now we do the actual imputation here with 20 trees per random forest
swz_pns_impute_sample <- missForest(xmis = as.matrix(swz_pns_preprocess_sample), maxiter = 5, ntree = 20, parallelize = "forests")

# ximp corresponds to our dataframe
swz_pns_complete_sample <- as.data.frame(swz_pns_impute_sample$ximp)

# Put it back together
swz_pns_sample <- swz_pns_sample %>% dplyr::select(-c(gender, age, education)) %>% bind_cols(swz_pns_complete_sample) %>% mutate(education = round(education), age=round(age))

Read more about missForest here.

Later, we will use the same data imputation method for our population-based shell data.

Generating Shell Data

According to the DHS survey documentation, the average household size in Eswatini in 2006 was 4.6 persons per household. Summing all of the cells in the raster data for the entire nation, we get an approximation for the population of Eswatini in 2006. Dividing through by 4.6 yields an approximation for the total number of households in all of Eswatini. We distribute the approximate number of households using rpoint and the underlying raster data for the entire nation and resample with replacement to generate our shell household data.

hhs_pts <- rpoint(swz_household_n, f = as.im(swaziland_raster_pop06), win = win)
hhs_locs <- cbind.data.frame(x = hhs_pts$x, y = hhs_pts$y)
hhs_pop <- dplyr::slice_sample(dhs_vars, n = swz_household_n, replace = TRUE)
hhs_pop_output <- hhs_pop  %>% head(10) 
hhs_pop_output

hhs_pop is a new data frame with 108 variables (only the ones of interest that we identified in the first section) and 220204 observations. This should approximately correspond to summing the sample weights divided by 10e5. We can check this assertion with the following code snippet and corresponding output.

As before with the sample data, we can pivot our variables paying special attention to education, gender, and age since they are enumerated values that are different for each household member. We find that our number of rows, i.e. 220204, matches pretty well to the sampled households as determined by the weights, i.e. 2.198761810^{5}.

Next we crate a data frame with the pivoted shell data, where each row corresponds to one individual. To check the results and calculate the error, we count the number of rows and compare it to both the sum of the weights divided by 10e5 and compare it to the population determined by summing every point in the raster.

The error for the shell data compared to what the weights generate as determined by comparing the generated number of people is 3.3826267%. The error for the shell data compared to what the underlying raster predicts for the population is 0.5173773%. Even though the DHS warns against ignoring the weights, for this simple observation we chose to not apply the weight and instead use the WorldPop raster data. Therefore, it makes sense that the error is greater for the weights, compared to that of the raster data since we used the raster data to generate locations for and sample from our sample population. In a future analysis, it would be worthwhile to include these weights when extrapolating to the general population to create a synthetic population.

While it should be the case that randomly sampling with replacement will select for a synthetic population that correlates with reality, we can look at each variable of interest and compare the proportion of totals for counts of each value.

The last step for the shell data before we can validate is to impute the data as we did before for the sample data. Again, we choose to use missForest to replace the empty cells. At this point we have created two data frames for individuals based on the DHS household data.

The first is our complete pivoted, imputed sample data and the second is our complete pivoted, imputed shell data.

To show our shell data correlates to our sample data, we can look at the education variable and compare proportions of observations.

*Figure 6:* Compare proportions for education sample data and shell data

Figure 6: Compare proportions for education sample data and shell data

The bars match which shows our generated shell data correlates well to the DHS sample data. Therefore, we can save the final shell data to a csv file and create a Python neural network as well as a Multinomial Logistic Regression model in R.

Models

Using the following two strategies, we want to predict the education variable for the entire population.

1) R Multinomial Logistic Regression

We start by partitioning our data according to a 70/30 training test split.

index <- as.numeric(createDataPartition(swz_pns_sample$education, p = .70, list = FALSE))
train <- swz_pns_sample[index,]
test <- swz_pns_sample[-index,]

The index to split on is determined by the function createDataPartition.

Then we generate our multinomial logistic regressing from our training data, providing the size of the household, gender, and age of each individual from the sample data. The “correct” output that we train against is given by the education column.

multinom_model <- multinom(education ~ hv009 + gender + age, data = swz_pns_sample)

We compute values for the test data and find the resulting accuracy.

test$eduPredicted <- predict(multinom_model, newdata = test, "class")
accuracy <- ((test$eduPredicted == test$education) %>% sum()) / nrow(test) * 100

Our model is 41.0480349% accurate at predicting our sample test data. Before we attempt to predict education for the entire population, let’s save our sample data and shell data to a csv file and try a neural network in Python.

One final note, when I first performed this regression, I used the shell data which makes less sense because the values are extrapolated from the sample data. More specifically, I was trying to train a model on data that already had a higher level of error.

2) Python Neural Network with Keras

Here is the Python script to generate a neural network that takes number of household members, age, and gender as inputs and predicts education. We craete a Sequential model with 3 layers and the indicated number of inputs to each layer. We train over 100 epochs (i.e. the number of times we recompute the model weights) and a batch size of 10 (the number of observations we consider before readjusting the weights).

from numpy import loadtxt
from keras.models import Sequential
from keras.layers import Dense
import csv

# split into input (X) and output (y) variables
X = []
y = []

# load the dataset and distribute to X and y
with open('keras/swz_pns_sample.csv') as csv_file:
  csv_reader = csv.reader(csv_file, delimiter=',')
  for row in csv_reader:
    X_list = []
    y_list = []
    X_list.append(int(row[2]))
    X_list.append(int(row[5]))
    X_list.append(int(row[6]))
    y_list.append(int(row[7]))
    X.append(X_list)
    y.append(y_list)

# 70-30 train test split
len_of_observations = len(X)
print("Number of observations: "+str(len_of_observations))
index = round(len_of_observations*0.7)
print("Index to split on: "+str(index)) 
X_train = X[:index]
X_test = X[index:]
y_train = y[:index]
y_test = y[index:]

# define the keras model
model = Sequential()
model.add(Dense(12, input_dim=3, activation='relu')) # First layer has 12 nodes and gets 3 inputs (ReLU)
model.add(Dense(8, activation='relu')) # Second layer has 8 nodes (ReLU)
model.add(Dense(1, activation='sigmoid')) # Third layer has one node (sigmoid)


# compile the keras model with a metric to access accuracy after each epock
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Fit our model using epochs and batches
# Batch size is the number of dataset rows that we consider before updating the model weights
# Epochs = number of times we go through the entire dataset
model.fit(X_train, y_train, epochs=100, batch_size=10) # 100 epochs and batch size of 10

# Evaluate the keras model on test data
_, accuracy = model.evaluate(X_test, y_test)
print('Accuracy: %.2f' % (accuracy*100))

# Make class predictions with the model
predictions = model.predict_classes(X_test)

# Print predictions for the first 10 sample household observations
for i in range(3):
  print("Input data: "+str(X_test[i]))
  print("Predicted education: "+str(predictions[i]))
  print("Actual education: "+str(y_test[i]))

Here is a partial output:

Number of observations: 22143
Index to split on: 15500

Epoch 1/100
  7 14 22 30 37 45 53 60 68 75 83 91 981051121191271331401471550/1550 [==============================] - 1s 683us/step - loss: -122.3098 - accuracy: 0.4230
Epoch 2/100
  75 14 22 29 37 44 52 59 67 74 80 87 941011091161231311391461541550/1550 [==============================] - 1s 687us/step - loss: -1586.8834 - accuracy: 0.4265
Epoch 3/100
  7 14 22 29 36 44 51 58 66 74 81 89 971041111181251321391471541550/1550 [==============================] - 1s 685us/step - loss: -5573.6860 - accuracy: 0.4265
...
Epoch 100/100
  6 12 19 25 31 37 44 50 56 62 68 76 83 90 971031101171251321391471541550/1550 [==============================] - 1s 751us/step - loss: -81860544.0000 - accuracy: 0.4265

Accuracy: 45.40

Input data: [8, 2, 12]
Predicted education: [1]
Actual education: [1]
Input data: [8, 2, 20]
Predicted education: [1]
Actual education: [2]
Input data: [5, 1, 14]
Predicted education: [1]
Actual education: [1]

Unfortunately, this new model is not that much more accurate than the previous Multinomial Logistic Regression. Perhaps random forests or some other machine learning model would perform better, or adjustments to this model like transfer learning would improve accurate, but those strategies could be investigated in a future report.

Summary

I decided to only focus on my adm0 because the population of Eswatini is relatively small and the implementation here was mostly computationally inexpensive. As we discussed in class, the accuracy likely decreases as you filter down to an adm1 because we are trying to create models based on less data.

Future investigation could focus on incorporating the weights according to the DHS specifications and trying a different machine learning model like random forests. Also, currently we reselect the same households with replacement to build out to our shell data, so it would be interesting to investigate the problem of homogenous shell data.

Here is my R script.
Here is my Python script.